About the project

I heard about this R course’s usefulness in handling research data quite efficiently so my interest was guaranteed, although it takes some time to get more familiar with this programming language. The idea of this course’s validness for me was through an acquintance.

Aim of the course

Looking forward to challenge myself with this totally new thing, and finding some tools for my research data analyzes. More about R Studio you can find from here

Destination of this text

You can find my project here: https://github.com/GH-Club/IODS-project


Exercise 2

1. Reading the data and exploring the structure

students2014 <- read.csv(file = "/Users/streetman/IODS-project/data/lrn14.csv", header = T, sep=",")
students2014 <- students2014[,-1]
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166   7

Commentary

The data is a subset from the main data, which includes different questions to students and also some demographical variables for the purpose of doing a survey of approaches to learning. This subdata consists 166 rows as observations and 7 columns as variables: gender, Age, Attitude, deep (deep learning related questions), stra (strategic learning related questions), surf (surface learning related questions).

2. Graphical overview

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(students2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

Commentary

Based on visualization we can detect for example that respondends were almost twice as much females (N=110) and their age varied between 17 and 55, and mean age was 25. Regarding correlation the strongest is between attitude and points and lowest between deep and points.

3. Regression model and fitting variables

## Fit a linear model
model_fit <- lm(Points ~ Attitude + stra + surf, data = students2014)

## Print out a summary of the model
summary (model_fit)
## 
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
## Fit a linear model
model_fit <- lm(Points ~ Attitude + stra, data = students2014)

## Print out a summary of the model
summary (model_fit)
## 
## Call:
## lm(formula = Points ~ Attitude + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## Attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
## Fit a linear model
model_fit <- lm(Points ~ Attitude, data = students2014)

## Print out a summary of the model
summary (model_fit)
## 
## Call:
## lm(formula = Points ~ Attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Commentary

The target variable is Points. For the chosen explanatory variables Attitude, stra and surf: with stra and surf there is no statistically significant relationship with the target variable (p=0.117 and 0.466), therefore surf was removed.

After that, with variable stra, there is no statistically significant relationship with the target variable (p=0.089), therefore it was removed.

After that, with variable attitude, there is statistically significant correlation (p = <0.000) to points with an explanatory value of 0.1906 (mult. R-squared) which mean that this variable explains approx. 19% of the points achieved by students.

4. Diagnostic plots (Residuals vs. Fitted values, Normal Q-Q plot and Residuals vs Leverage)

my_model2 <- lm(Points ~ Attitude, data = students2014)

plot(my_model2, which = c(1,2,5))

Commentary

The residuals are the differences between the predicted and observed value in a given point. “Residuals vs. Fitted”: assumption that errors don’t depend on variable attitude is valid since the plots are evenly spread without any specific pattern. “Normal Q-Q”: assumption is that the residuals are normally distributed and this supports it. “Residuals vs. Leverage”: in general observations are not having unusually high impact.


Exercise 3

1. Getting and reading the .csv file

getwd()
## [1] "/Users/streetman/IODS-project"
alc <- read.csv(file = "/Users/streetman/IODS-project/data/alc.csv", header = T, sep=",")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
colnames(alc)
##  [1] "X"          "school"     "sex"        "age"        "address"   
##  [6] "famsize"    "Pstatus"    "Medu"       "Fedu"       "Mjob"      
## [11] "Fjob"       "reason"     "nursery"    "internet"   "guardian"  
## [16] "traveltime" "studytime"  "failures"   "schoolsup"  "famsup"    
## [21] "paid"       "activities" "higher"     "romantic"   "famrel"    
## [26] "freetime"   "goout"      "Dalc"       "Walc"       "health"    
## [31] "absences"   "G1"         "G2"         "G3"         "alc_use"   
## [36] "high_use"

Commentary

The dataset includes the variables above indicating the questions to students relating to their alcohol usage

2. Relationships between high and low alcohol consumption, and choosing the parameters for this exploration

Choosing the parameters

library(tidyr)
glimpse(alc)
## Observations: 382
## Variables: 36
## $ X          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G…
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F,…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 1…
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3…
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T,…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3,…
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services, o…
## $ Fjob       <fct> teacher, other, other, services, other, other, other,…
## $ reason     <fct> course, course, other, home, home, reputation, home, …
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, y…
## $ guardian   <fct> mother, father, mother, mother, father, mother, mothe…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2,…
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no…
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, y…
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no…
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, y…
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, …
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2,…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4,…
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3,…
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 1…
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, …
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12,…
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5…
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE…

Commentary

These four variables were chosen: 1. Age (numeric) 2. Famsize (binary: less or equal to 3, or greater than 3) 3. Failures (numeric: 0-3, or 4 if smth else) 4. Absences (numeric: 0 to 93)

Hypothesis: there is a correlation between the high alcohol consumption and the chosen variables when analyzed in a univariate method to observe correlation H0: there is no correlation between a chosen variable and high alcohol consumption H1: there is significant correlation between a chosen variable and high alcohol consumption

3. Distributions between the chosen variables above and the relations to alcohol consumption

Numerical overview of data

summary(alc)
##        X          school   sex          age        address famsize  
##  Min.   :  1.00   GP:342   F:198   Min.   :15.00   R: 81   GT3:278  
##  1st Qu.: 96.25   MS: 40   M:184   1st Qu.:16.00   U:301   LE3:104  
##  Median :191.50                    Median :17.00                    
##  Mean   :191.50                    Mean   :16.59                    
##  3rd Qu.:286.75                    3rd Qu.:17.00                    
##  Max.   :382.00                    Max.   :22.00                    
##  Pstatus      Medu            Fedu             Mjob           Fjob    
##  A: 38   Min.   :0.000   Min.   :0.000   at_home : 53   at_home : 16  
##  T:344   1st Qu.:2.000   1st Qu.:2.000   health  : 33   health  : 17  
##          Median :3.000   Median :3.000   other   :138   other   :211  
##          Mean   :2.806   Mean   :2.565   services: 96   services:107  
##          3rd Qu.:4.000   3rd Qu.:4.000   teacher : 62   teacher : 31  
##          Max.   :4.000   Max.   :4.000                                
##         reason    nursery   internet    guardian     traveltime   
##  course    :140   no : 72   no : 58   father: 91   Min.   :1.000  
##  home      :110   yes:310   yes:324   mother:275   1st Qu.:1.000  
##  other     : 34                       other : 16   Median :1.000  
##  reputation: 98                                    Mean   :1.448  
##                                                    3rd Qu.:2.000  
##                                                    Max.   :4.000  
##    studytime        failures      schoolsup famsup     paid     activities
##  Min.   :1.000   Min.   :0.0000   no :331   no :144   no :205   no :181   
##  1st Qu.:1.000   1st Qu.:0.0000   yes: 51   yes:238   yes:177   yes:201   
##  Median :2.000   Median :0.0000                                           
##  Mean   :2.037   Mean   :0.2016                                           
##  3rd Qu.:2.000   3rd Qu.:0.0000                                           
##  Max.   :4.000   Max.   :3.0000                                           
##  higher    romantic      famrel         freetime        goout      
##  no : 18   no :261   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  yes:364   yes:121   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:2.000  
##                      Median :4.000   Median :3.00   Median :3.000  
##                      Mean   :3.937   Mean   :3.22   Mean   :3.113  
##                      3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:4.000  
##                      Max.   :5.000   Max.   :5.00   Max.   :5.000  
##       Dalc            Walc           health         absences   
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   : 0.0  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 1.0  
##  Median :1.000   Median :2.000   Median :4.000   Median : 3.0  
##  Mean   :1.482   Mean   :2.296   Mean   :3.573   Mean   : 4.5  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 6.0  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :45.0  
##        G1              G2              G3           alc_use     
##  Min.   : 2.00   Min.   : 4.00   Min.   : 0.00   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:1.000  
##  Median :12.00   Median :12.00   Median :12.00   Median :1.500  
##  Mean   :11.49   Mean   :11.47   Mean   :11.46   Mean   :1.889  
##  3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:14.00   3rd Qu.:2.500  
##  Max.   :18.00   Max.   :18.00   Max.   :18.00   Max.   :5.000  
##   high_use      
##  Mode :logical  
##  FALSE:268      
##  TRUE :114      
##                 
##                 
## 

Commentary

For example in variable “absences” the mean absence value is 4.5 and maximum is 45. Range is wide, although 3rd quartile is 6.0, therefore most are within values 0 to 6.0.

Graphical overview of data

Distribution of the chosen variables

library(tidyr); library(dplyr); library(ggplot2)
alc_selected <- select(alc, one_of(c("age", "famsize", "failures", "absences")))
gather(alc_selected) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

Commentary

For example in variable “age” we can detect that mainly the participants are below 19 years old.

Variables vs. alcohol usage by histograms

Age

g1 <- ggplot(alc, aes(x = high_use, y = age))
g1 + geom_boxplot(aes(col=sex)) + ylab("age") + ggtitle("Students alcohol consumption and age")

Commentary

Those with high alcohol consumption were mainly male and their age range is wider compared to female within this consumption category.

Famsize

g1 <- ggplot(data = alc, aes(x = alc_use, fill = famsize))
g1 + geom_bar()

Commentary

It seems that more higher the alcohol consumption then more equal are the family sizes. With lower alcohol consumption the family sizes were considerable larger.

Failures

alc$failures <-  as.factor(alc$failures)
g1 <- ggplot(alc, aes(x = high_use, fill = failures))
g1 + geom_bar()

Commentary

Somewhat surpringly, in lower alcohol consumption the amount of failures is higher.

Absences

g1 <- ggplot(alc, aes(x = high_use, y = absences))
g1 + geom_boxplot(aes(col=sex)) + ylab("absences") + ggtitle("Students alcohol consumption and absences")

Commentary

It seems that more the absences then more are is the high alcohol consumption with both sexes.

4. Logistic regression model of the chosen variables

Fitting the LR model

## Fit a linear model
alc$failures <-  as.numeric(alc$failures)
alc$famsize <-  as.numeric(alc$famsize)
model_fit <- lm(high_use ~ age + famsize + failures + absences, data = alc)

## Print out a summary of the model
summary (model_fit)
## 
## Call:
## lm(formula = high_use ~ age + famsize + failures + absences, 
##     data = alc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9850 -0.2883 -0.2132  0.5052  0.8587 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.399649   0.327648  -1.220  0.22332    
## age          0.023963   0.019765   1.212  0.22612    
## famsize      0.075069   0.050854   1.476  0.14073    
## failures     0.106463   0.039562   2.691  0.00744 ** 
## absences     0.017151   0.004184   4.099 5.09e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4419 on 377 degrees of freedom
## Multiple R-squared:  0.07954,    Adjusted R-squared:  0.06977 
## F-statistic: 8.144 on 4 and 377 DF,  p-value: 2.625e-06
## Fit a linear model
alc$failures <-  as.numeric(alc$failures)
model_fit <- lm(high_use ~ age + failures + absences, data = alc)

## Print out a summary of the model
summary (model_fit)
## 
## Call:
## lm(formula = high_use ~ age + failures + absences, data = alc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0151 -0.2787 -0.2162  0.5361  0.8393 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.308391   0.322265  -0.957  0.33920    
## age          0.024304   0.019795   1.228  0.22029    
## failures     0.104492   0.039601   2.639  0.00867 ** 
## absences     0.017366   0.004188   4.146 4.17e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4426 on 378 degrees of freedom
## Multiple R-squared:  0.07422,    Adjusted R-squared:  0.06687 
## F-statistic:  10.1 on 3 and 378 DF,  p-value: 2.045e-06
## Fit a linear model
alc$failures <-  as.numeric(alc$failures)
alc$famsize <-  as.numeric(alc$famsize)
model_fit <- lm(high_use ~ failures + absences, data = alc)

## Print out a summary of the model
summary (model_fit)
## 
## Call:
## lm(formula = high_use ~ failures + absences, data = alc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0035 -0.2801 -0.2127  0.5510  0.8053 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.081634   0.054308   1.503  0.13363    
## failures    0.113115   0.038999   2.900  0.00394 ** 
## absences    0.017973   0.004162   4.319 2.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4429 on 379 degrees of freedom
## Multiple R-squared:  0.07052,    Adjusted R-squared:  0.06562 
## F-statistic: 14.38 on 2 and 379 DF,  p-value: 9.573e-07

Commentary

The target variable is high_use (of alcohol). For the chosen explanatory variables age, family size, failures and absences: with age and family size there is no statistically significant relationship with the target variable (p=0.089 and 0.170), therefore family size was removed.

After that, with variable age, there is no statistically significant relationship with the target variable (p=0.089), therefore it was removed.

After that, with variables failures and absences, there is statistically significant correlation (p=0.004 and <0.000) to high_use (of alcohol) with an explanatory value of 0.07 (mult. R-squared) which means that these variables explains approx. 7% of the high_use (of alcohol) among the participants.

Odds rations and CI

### Fitting the logistic regressio model
m <- glm(high_use ~ failures + absences, data = alc, family = "binomial")

### Compute odds ratios (OR)
OR <- coef(m) %>% exp

### Compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
### Printing the output
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.1528951 0.08788853 0.2604241
## failures    1.6482547 1.14889905 2.3830880
## absences    1.0898257 1.04433337 1.1423489

Commentary

Both variables’, failures and absences, ORs indicate higher odds for high_use (of alcohol) within these variables; yet, the CI crosses value 1 with the variable failures, therefore it seems to inadequate for showing statistical significance in this sense. By this analysis, from the chosen variables, absenses seems to be the one with most statistical significance in its relation to high_use (of alcohol).

5. Predictive power of LR model

## Fit the model
m <- glm(high_use ~ absences, data = alc, family = "binomial")

## Predict() the probability of high_use
probabilities <- predict(m, type = "response")

## Add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

## Use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

## Tabulate the target variable versus the predictions
results <- table(high_use = alc$high_use, prediction = alc$prediction)

results
##         prediction
## high_use FALSE TRUE
##    FALSE   263    5
##    TRUE    105    9
tab <- table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()

tab
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.68848168 0.01308901 0.70157068
##    TRUE  0.27486911 0.02356021 0.29842932
##    Sum   0.96335079 0.03664921 1.00000000
## Total proportion of inaccurately classified individuals
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2879581

Commentary

After LR analysis, it was stated that variable “absences” has the most statistical significance it relation to high_use (of alcohol), therefore chosen here to test the predictive power of the LR model.

After calculation, the value of 0.2879 indicates that approx. 29% are incorrect predictions with this model, yet 71% are correct ones, therefore this LR model is useful.


Exercise 4

1. Loading the data and exploring it

install.packages(“MASS”)

install.packages(“corrplot”)

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(dplyr)
library(tidyr)
library(ggplot2)
library(GGally)
library(corrplot)
## corrplot 0.84 loaded

Loading

data("Boston")

Exploring

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Commentary

The dataset has 14 columns (variables) and 506 rows (observations). The data is about housing values in Boston suburbs. More about the data variables you can find here

2. Graphical overview and summaries

Summary

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Correlations

cor_matrix <- cor(Boston)

cor_matrix %>% round(2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
"VISUALIZING the correlations"
## [1] "VISUALIZING the correlations"
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

Commentary

Crime rate seems to have strongest positive correlation to property taxation rate and highways accessibility.

3. Standardize

boston_scaled <- scale(Boston)

summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Commentary

We scaled the data to get standardization for to be able to compare variables in a better way.

Creating categorical variable

boston_scaled <- as.data.frame(boston_scaled)

bins <- quantile(boston_scaled$crim)

crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low","med_low","med_high","high"), include.lowest = TRUE)

boston_scaled <- dplyr::select(boston_scaled, -crim)

boston_scaled <- data.frame(boston_scaled, crime)

nrow(boston_scaled)
## [1] 506
n <- nrow(boston_scaled)

ind <- sample(n,  size = n * 0.8)

train <- boston_scaled[ind,]

test <- boston_scaled[-ind,]

correct_classes <- test$crime

test <- dplyr::select(test, -crime)

summary(test)
##        zn               indus              chas               nox         
##  Min.   :-0.48724   Min.   :-1.4470   Min.   :-0.27233   Min.   :-1.4040  
##  1st Qu.:-0.48724   1st Qu.:-0.9554   1st Qu.:-0.27233   1st Qu.:-0.9272  
##  Median :-0.48724   Median :-0.4368   Median :-0.27233   Median :-0.3426  
##  Mean   : 0.08004   Mean   :-0.1448   Mean   :-0.04074   Mean   :-0.1354  
##  3rd Qu.: 0.37030   3rd Qu.: 1.0150   3rd Qu.:-0.27233   3rd Qu.: 0.4169  
##  Max.   : 3.37170   Max.   : 2.1155   Max.   : 3.66477   Max.   : 2.7296  
##        rm                age               dis          
##  Min.   :-3.87641   Min.   :-2.0809   Min.   :-1.19192  
##  1st Qu.:-0.56878   1st Qu.:-0.9459   1st Qu.:-0.74287  
##  Median :-0.17098   Median : 0.2886   Median :-0.19537  
##  Mean   :-0.06577   Mean   :-0.0556   Mean   : 0.08213  
##  3rd Qu.: 0.37768   3rd Qu.: 0.8926   3rd Qu.: 0.79494  
##  Max.   : 2.81998   Max.   : 1.1164   Max.   : 3.28405  
##       rad                tax              ptratio        
##  Min.   :-0.98187   Min.   :-1.30676   Min.   :-2.51994  
##  1st Qu.:-0.63733   1st Qu.:-0.78313   1st Qu.:-0.48756  
##  Median :-0.52248   Median :-0.39895   Median : 0.08983  
##  Mean   :-0.08562   Mean   :-0.08546   Mean   : 0.04137  
##  3rd Qu.:-0.17794   3rd Qu.: 0.17066   3rd Qu.: 0.80578  
##  Max.   : 1.65960   Max.   : 1.52941   Max.   : 1.63721  
##      black               lstat               medv         
##  Min.   :-3.903331   Min.   :-1.50301   Min.   :-1.68888  
##  1st Qu.: 0.207114   1st Qu.:-0.79478   1st Qu.:-0.58799  
##  Median : 0.393297   Median :-0.33161   Median :-0.22646  
##  Mean   : 0.007433   Mean   :-0.02143   Mean   :-0.01273  
##  3rd Qu.: 0.440616   3rd Qu.: 0.42213   3rd Qu.: 0.28457  
##  Max.   : 0.440616   Max.   : 3.40663   Max.   : 2.98650

Commentary

We made a categorical variable of crime rate with the division into quantiles indicated above. We also made datasets (“training” and “test”) for testing and implementing the LDA (linear discriminant analysis) model.

4. Fit the linear

lda.fit <- lda(crime ~ ., data = train)

lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2400990 0.2500000 0.2450495 0.2648515 
## 
## Group means:
##                   zn      indus        chas        nox         rm
## low       0.93521987 -0.8724530 -0.10997442 -0.8493333  0.5037053
## med_low  -0.09731296 -0.2386718 -0.03844192 -0.5495981 -0.1647176
## med_high -0.37290116  0.1482665  0.20489520  0.4004736  0.1364623
## high     -0.48724019  1.0170108 -0.01476176  1.0473201 -0.3647096
##                 age        dis        rad        tax     ptratio
## low      -0.8456021  0.8258700 -0.6965304 -0.7626577 -0.45088994
## med_low  -0.3702610  0.3890044 -0.5497747 -0.4593960 -0.05857918
## med_high  0.4097210 -0.3830596 -0.4401196 -0.3332586 -0.38631073
## high      0.7899875 -0.8397445  1.6392096  1.5148289  0.78203563
##                black       lstat        medv
## low       0.38099714 -0.78077152  0.56362182
## med_low   0.34657072 -0.08564246 -0.03564772
## med_high  0.06246724 -0.03002024  0.19741850
## high     -0.73740944  0.83685068 -0.64781696
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.07058871  0.72066372 -0.98831510
## indus    0.04180128 -0.07211078  0.39155827
## chas    -0.13052780 -0.05246700  0.05704959
## nox      0.34196225 -0.79819437 -1.09728588
## rm      -0.09005854 -0.08489643 -0.16074014
## age      0.21827904 -0.30988454 -0.24464117
## dis     -0.03469009 -0.17988822  0.40345681
## rad      3.72467423  0.91409926 -0.21900898
## tax      0.06866486 -0.06288839  0.65027508
## ptratio  0.11073825  0.07609792 -0.26913691
## black   -0.09532881  0.09467939  0.20523587
## lstat    0.22226254 -0.19752611  0.49478322
## medv     0.19276678 -0.36854075 -0.08056736
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9611 0.0273 0.0116
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1)

Commentary

Here we tested the LDA model to the “training” dataset, and by visualization we detect distintiveness between crime rate groups with some amount of overlapping between them.

5. Predicting the classes and cross tabulation

lda.pred <- predict(lda.fit, newdata = test)

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       20       9        1    0
##   med_low    5      13        7    0
##   med_high   1       9       14    3
##   high       0       0        0   20

Commentary

Here we implemented the LDA to “test” dataset. It indicates that higher crime values are better predicted than lower ones. As a result, not the optimal prediction model, but fairly good.

6. Reload and standardize

data(Boston)

boston_scaled_2 <- scale(Boston)

str(boston_scaled_2)
##  num [1:506, 1:14] -0.419 -0.417 -0.417 -0.416 -0.412 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:506] "1" "2" "3" "4" ...
##   ..$ : chr [1:14] "crim" "zn" "indus" "chas" ...
##  - attr(*, "scaled:center")= Named num [1:14] 3.6135 11.3636 11.1368 0.0692 0.5547 ...
##   ..- attr(*, "names")= chr [1:14] "crim" "zn" "indus" "chas" ...
##  - attr(*, "scaled:scale")= Named num [1:14] 8.602 23.322 6.86 0.254 0.116 ...
##   ..- attr(*, "names")= chr [1:14] "crim" "zn" "indus" "chas" ...
dim(boston_scaled_2)
## [1] 506  14

Distance matrix

dist_eu <- dist(boston_scaled_2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Distance matrix (Manhattan method)

dist_man <- dist(boston_scaled_2, method = "manhattan")

summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Clustering

km <- kmeans(boston_scaled_2, centers = 3)

pairs(boston_scaled_2, col = km$cluster)

Determining

k_max <- 10

twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

qplot(x = 1:k_max, y = twcss, geom = 'line')

km <- kmeans(boston_scaled_2, centers = 2)

pairs(boston_scaled_2, col = km$cluster)

Commentary

Here we clustered the data by loading it again and concluded that the it is the most optimally clustered via two clusters as indicated in the graph above where curve changes the most at this point.


Exercise 5

library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
library(GGally)
library(corrplot)
library(tidyr)
library(ggplot2)

1. Getting and reading the .csv file

getwd()
## [1] "/Users/streetman/IODS-project"
human <- read.csv(file = "/Users/streetman/IODS-project/data/human.csv", header = T, sep=",")
dim(human)
## [1] 155   9
str(human)
## 'data.frame':    155 obs. of  9 variables:
##  $ X        : Factor w/ 155 levels "Afghanistan",..: 105 6 134 41 101 54 67 149 27 102 ...
##  $ Edu2.FM  : num  97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : Factor w/ 155 levels "1,123","1,228",..: 132 109 124 112 113 111 103 123 108 93 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

Commentary

We can detect for example expected years of education is 4.10 years at minimum with a median of 8.4 years.

2. Graphical overview

human$GNI <- as.integer(human$GNI)

human_ <- human[,-1]

dim(human_)
## [1] 155   8
str(human_)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : int  132 109 124 112 113 111 103 123 108 93 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
ggpairs(human_)

cor(human_) %>% corrplot(method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

3. PCA

pca_human <- prcomp(human_)

biplot(pca_human, choices = 1:2, cex=c(0.8, 1), col=c("grey40", "deeppink2"))

4. Standardize

human_std <- scale(human_)
summary(human_std)
##     Edu2.FM            Labo.FM           Life.Exp          Edu.Exp       
##  Min.   :-1.76122   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.91250   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.03967   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.96275   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 1.44288   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-1.7154   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.8577   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median : 0.0000   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.8577   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 1.7154   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
pca_human <- prcomp(human_std)
biplot(pca_human, choices = 1:2, cex=c(0.8, 1), col=c("grey40", "deeppink2"))

5. Interpretation

6. Loading the Tea dataset

library(FactoMineR)

data(tea)

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
library(dplyr)

keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))

# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x =  element_text(angle = 45, hjust = 1, size = 8))

MCA

mca <- MCA(tea_time, graph = FALSE)

summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage = "quali")